BiBERT: Accurate Fully Binarized BERT
141
∂bool(x)
∂x
=
1,
if |x| ≤1
0,
otherwise.
(5.29)
By applying bool(·) function, the elements in attention weight with lower value are binarized
to 0. Thus the obtained entropy-maximized attention weight can filter the crucial part of
elements. And the proposed Bi-Attention structure is finally expressed as
BA = bool (A) = bool
( 1
√
D
BQ ⊗BK
⊤)
,
(5.30)
Bi-Attention(BQ, BK, BV) = BA ⊠BV,
(5.31)
where BV is the binarized value obtained by sign(V), BA is the binarized attention weight,
and ⊠is a well-designed Bitwise-Affine Matrix Multiplication (BAMM) operator composed
by ⊗and bitshift to align training and inference representations and perform efficient bitwise
calculation.
In a nutshell, in Bi-Attention structure, the information entropy of binarized attention
weight is maximized (as Fig. 5.14(c) shows) to alleviate its immense information degradation
and revive the attention mechanism. Bi-Attention also achieves greater efficiency since the
softmax is excluded.
5.9.2
Direction-Matching Distillation
As an optimization technique based on element-level comparison of activation, distillation
allows the binarized BERT to mimic the full-precision teacher model about intermediate
activation. However, distillation causes direction mismatch for optimization in the fully
binarized BERT baseline, leading to insufficient optimization and even harmful effects. To
address the direction mismatch occurred in fully binarized BERT baseline in the backward
propagation, the authors further proposed a DMD scheme with apposite distilled activations
and the well-constructed similarity matrices to effectively utilize knowledge from the teacher,
which optimizes the fully binarized BERT more accurately.
Their efforts first fall into reselecting the distilled activations for DMD by distilling the
upstream query Q and key K instead of attention score in DMD for distillation to utilize its
knowledge while alleviating direction mismatch. Besides, the authors also distilled the value
V to further cover all the inputs of MHA. Then, similarity pattern matrices are constructed
for distilling activation, which can be expressed as
PQ =
Q × Q⊤
∥Q × Q⊤∥, PK =
K × K⊤
∥K × K⊤∥, PV =
V × V⊤
∥V × V⊤∥,
(5.32)
where ∥· ∥denotes ℓ2 normalization. The corresponding PQT , PKT , PVT are constructed
in the same way by the teacher’s activation. The distillation loss is expressed as:
ℓdistill = ℓDMD + ℓhid + ℓpred,
(5.33)
ℓDMD =
l∈[1,L]
F∈FDMD
∥PFl −PFT l∥,
(5.34)
where L denotes the number of transformer layers, FDMD = {Q, K, V}. The loss term ℓhid
is constructed as the ℓ2 normalization form.
The overall pipeline for BiBERT is shown in Fig. 5.15. The authors conducted experi-
ments on the GLUE benchmark for binarizing various BERT-based pre-trained models. The
results listed in Table 5.7 shows that BiBERT surpasses BinaryBERT by a wide margin in
the average accuracy.